For the lecture I used the following code to generate data:
n = 50
set.seed(101)
x1 = runif(n)
x2 = runif(n)
prob = ifelse(x1 < 0.5 & x1 > 0.25 & x2 > 0.25 & x2<0.75,0.8,0.3)
y = as.factor(rbinom(n,1,prob))
levels(y) = c("1","2")
df = data.frame(y=y,x1=x1,x2=x2)And then I used the following function to make KNN predictions:
knn_pred = function(point,x1,x2,y,k=5, thresh = 0.5) {
dists = sqrt((x1-point[1])^2+(x2-point[2])^2) #Find all distances to current obs
bound = sort(dists)[k] #Find kth smallest distance
indices = which(dists <= bound) #Find which obs have dists 1:k
outcomes = as.integer(y[indices]) #Find corresponding outcomes y
prob = mean(outcomes)-1
if (prob < thresh){
out = 1
} else {
out = 2
}
out
}This code builds a grid of points, and then makes predictions for each of those points.
grid.fineness = 201
sequence = seq(0,1,length.out=grid.fineness)
grid = expand.grid(sequence,sequence)
colnames(grid) = c("x1","x2")
yhat = apply(grid,1,knn_pred,x1=x1,x2=x2,y=y,k=5)
yhat = as.factor(yhat)With those predictions, we can build a dataframe, and plot.
df = as.data.frame(grid)
df$y = yhat
ggplot(df,aes(x=x1,y=x2,col=y))+geom_point(size=0.08)If we drop the round and subtract 1 in our knn_pred function, we can get probabilities out.
knn_prob = function(point,x1,x2,y,k=5) {
dists = sqrt((x1-point[1])^2+(x2-point[2])^2) #Find all distances to current obs
bound = sort(dists)[k] #Find kth smallest distance
indices = which(dists <= bound) #Find which obs have dists 1:k
outcomes = as.integer(y[indices]) #Find corresponding outcomes y
mean(outcomes)-1 #Taking advantage of 2 outcomes.
}We can predict those probabilities at each point:
phat = apply(grid,1,knn_prob,x1=x1,x2=x2,y=y,k=5)
df$phat = phat
ggplot(df,aes(x=x1,y=x2,col=phat))+geom_point(size=0.08)This is, in essence, the beginnings of a simulation study. We generated data, and we can look at how our predictions perform. We can do this with either the classifications or the underlying probabilities.
We are going to extend this simulation study in a few ways.
Resetting the seed with:
set.seed(101)Run the same data generation code, but with a sample size of 1000. Plot the resulting probabilities when we use \(K=1\), \(K=5\), and \(K=25\).
n = 1000
set.seed(101)
x1 = runif(n)
x2 = runif(n)
prob = ifelse(x1 < 0.5 & x1 > 0.25 & x2 > 0.25 & x2<0.75,0.8,0.3)
y = as.factor(rbinom(n,1,prob))
levels(y) = c("1","2")
bigdf = data.frame(y=y,x1=x1,x2=x2)
grid.fineness = 201
sequence = seq(0,1,length.out=grid.fineness)
grid = expand.grid(sequence,sequence)
colnames(grid) = c("x1","x2")
# yhat = apply(grid,1,knn_pred,x1=x1,x2=x2,y=y,k=1)
# yhat = as.factor(yhat)
bigdf = as.data.frame(grid)
# Create probabilities for each K value
bigdf$phat1 = apply(grid,1,knn_prob,x1=x1,x2=x2,y=y,k=1)
bigdf$phat5 = apply(grid,1,knn_prob,x1=x1,x2=x2,y=y,k=5)
bigdf$phat25 = apply(grid,1,knn_prob,x1=x1,x2=x2,y=y,k=25)ggplot(bigdf,aes(x=x1,y=x2,col=phat1))+geom_point(size=0.08)+
labs(title = 'K=1')ggplot(bigdf,aes(x=x1,y=x2,col=phat5))+geom_point(size=0.08)+
labs(title = 'K=5')ggplot(bigdf,aes(x=x1,y=x2,col=phat25))+geom_point(size=0.08)+
labs(title = 'K=25')Plot the classification predictions when \(K=10\), using a probability threshold of 0.2 for our predictions instead of the standard 0.5.
yhat10 = apply(grid,1,knn_pred,x1=x1,x2=x2,y=y,k=10, thresh = 0.2) %>% as.factor()
bigdf$yhat10 = yhat10
ggplot(df,aes(x=x1,y=x2,col=yhat10))+geom_point(size=0.08) +
labs(title = 'K=10',subtitle = 'Probability threshold of 0.2')Fit an interacted logit to this data. (i.e. model \(Y\sim x1+x2+x1:x2\) – using glm, not a LASSO). Find the predicted probabilities for every point in our grid, and plot those predicted probabilities.
gen_dat <- data.frame(y=y,x1=x1,x2=x2)
mymod <- glm(y ~ x1 + x2 + x1*x2, data = gen_dat, family = 'binomial')
preds <- predict.glm(mymod, grid, type = 'response')
ggplot(df, aes(x=x1,y=x2,col=preds))+ geom_point(size = 0.9)Plot the (in-sample) ROC curves for both the logit model and the KNN with \(K=10\). (hint: I have a function for doing this, given outcomes and probabilities in the lecture)
# Function for plotting ROC Curves
roc <- function(p,y, ...){
y = factor(y) # outcomes
n = length(p)
p = as.vector(p) # probabilities
Q = p > matrix(rep(seq(0,1,length=100),n),ncol=100,byrow=TRUE)
specificity = colMeans(!Q[y==levels(y)[1],])
sensitivity = colMeans(Q[y==levels(y)[2],])
plot(1-specificity, sensitivity, type="l", ...)
abline(a=0,b=1,lty=2,col=8)
paste('Alexandria Uscasio Cortez value (kind of):', sum( (sensitivity - (1-specificity)/100) )) # Riemann sum for aoc
}
borgine <- gen_dat %>%
select(-y)
phat10 <- apply(borgine,1,knn_prob,x1=x1,x2=x2,y=y,k=10)
roc(p=predict(mymod, type='response'), y=y, bty="n")## [1] "Alexandria Uscasio Cortez value (kind of): 36.8884509897397"
roc(p=phat10, y=y, bty = "n")## [1] "Alexandria Uscasio Cortez value (kind of): 48.4470483896598"
get_area <- function(sens, spec){
}Which of these models looks better?
KNN model look better bc
Please complete the surveys posted last week (and in announcements on canvas). They will help me ensure the last 5 weeks of class are as useful to you as possible.
y~x1+x1^2+x2+x2^2+x1:x2)We have a new output with three categories.
# x1,x2,y from your sample with 1k observations need to exist to run this.
set.seed(101)
z = ifelse((x1>0.8 | x2 < 0.4),rbinom(length(y),1,0.8),(y==1)*2)Modify knn_pred so that it predicts the most likely category out of 3 categories (hint: the functions table, which.max, names, and as.integer are how I did this) (hint2: maybe start by building a function that takes a vector and finds the most common element, then fit it into the rest of this). Plot the grid-predictions with this new classifier.